Winner’s Report: KDD CUP Breast Cancer Identification
نویسندگان
چکیده
We describe the ideas and methodologies that we developed in addressing the KDD Cup 2008 on early breast cancer detection, and discuss how they contributed to our success. 1. TASK AND DATA DESCRIPTION The Siemens KDD Cup 2008 comprised two prediction tasks in breast cancer detection from images. The organizers provided data from 1712 patients for training; of these 118 had cancer. Within the four breast images (two views for each breast per patient) are suspect locations (called candidates)), that are described by their coordinates and 117 features. No explanation of the features was given to the competition participants. Overall the training set includes 102,294 candidates, 623 of which are positive. A second dataset with similar properties was used as the test set for competition evaluation. The two modeling tasks were: Task 1: Rank the candidates by the likelihood of being cancerous in decreasing order. The evaluation criterion for this task was the FROC score, which measures how many of the actual patients with cancer are identified while limiting the number of candidate false alarms to between 0.2 and 0.3 per image. This was meant as a realistic representation when the prediction model is used as an actual decision support tool for radiologists. Task 2: Suggest a maximal list of test-set patients who are healthy. In this competition, including any patient with cancer in the list will disqualify the entry. This was meant to represent a scenario where the model is used to save the radiologist work by ruling out patients who are definitely healthy, and thus the model was required to have no false negatives. Several aspects of the data and the tasks made this competition interesting, including: • The presence of leakage, whereby patient IDs turned out to carry significant information about a patient’s likelihood to be malignant. The issue of leakage in Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGKDD 2008 Las Vegas, NV USA Copyright 200X ACM X-XXXXX-XX-X/XX/XX ...$5.00. general is widespread in competitions, and also in reallife efforts. We discuss this competition’s example and others in Section 2. • Unique data properties, including the presence of extreme outliers and the combination of the features with neighborhood-based information from the location of candidates. These properties and our efforts in alleviating and using them, are discussed in Section 3. • The unique FROC score, which treats patients as positive examples, but candidates as negative examples. This clearly has implications on the way in which models should rank candidates, preferentially combining candidates from different patients over many good candidates from the same patient. We address this in the context of post-processing schemes for model scores in Section 4. We discuss our final submitted models in Section 5. 2. LEAKAGE IN PATIENT ID Leakage can be defined as the introduction of predictive information about the target by the data generation, collection, and preparation process. Such information leakage while potentially highly predictive out-of-sample within the study leads to limited generalization, model applicability, or overestimation of the model performance. Two of the most common causes for leakage are: 1. Combination of data from multiple sources and/or multiple time points, followed by a failure to completely anonymize the data and hide the different sources. 2. Accidental creation of artificial dependencies and additional information while preparing the data for the competition or proof-of-concept. This year’s KDD Cup data suffered from leakage that was probably due to the first the cause. The patient IDs in the competition data carried significant information towards identifying malignant patients. This is best illustrated through a discretization of the patient ID range in the training data, as demonstrated in Figure 1. The patient IDs are naturally divided into three disjoint bins: between 0 and 20,000 (254 patients; 36% malignant); between 100,000 and 500,000 (414 patients; 1% malignant); and above 4,000,000 (1044 patients, of them 1.7% malignant). We can further observe that all afflicted patients in the last bin (18) have patient IDs in the range 4,000,000 to 4,870,000, and there are only 3 healthy patients in this range. This gives us a four-bin division of the data with great power to identify malignant patients. This binning and its correlation with the patient’s state generalized perfectly to the test data as well. Our hypothesis is that this leakage reflects the compilation of the competition data from different medical institutions and maybe different equipment, where the identity of the source is reflected in the ID range and is highly informative of the patient’s outcome. For example, one source might be a preventive care institution with only very low base rate of malignant patients and another could be a treatment-oriented institution with much higher cancer prevalence. Figure 1: Distribution of malignant (black) and benign (gray) candidates depending on Patient ID on the X-axis in log scale. The Y-axes is the score of a linear SVM model on the 117 features. The vertical lies show the boundaries of the identified ID bins. While it is clear that such leakage does not represent a useful pattern for real application, we consider its discovery and analysis an integral and important part of successful data analysis. Furthermore, we suggest that the problem in this dataset may actually run deeper and that all participants unknowingly benefited somewhat from a leakage problem and all reported performances are likely to be inflated. If indeed the predictiveness of the identifiers was caused by the combination of data from different sources, there may be additional implicit leakages due to differences in data collection settings (e.g., machine calibration). This would still be present even if the patient IDs had been removed. We test this hypothesis with the following experiment: If such a leakage exists (say the average grayscale is slightly different), it should be possible to predict the data source (i.e., one of the four identifier bins) from negative candidates only. We cannot include positives because we already know that the cancer prevalence is correlated with the bins. Our analysis shows that both group 1 (ID below 20000) and group 4 (ID above 4,870,000) are easily identified by a logistic model from the 117 provided features with AUCs of 0.865 and 0.75 respectively. Given this result we feel confident to conclude that any reasonable model can infer the patient group to some extent from the 117 variables and thereby implicitly the cancer prevalence in that patient population. So all models built on this data set are likely to overestimate the true predictive performance of cancer detection when applied to an entirely different population. More generally, experience has shown that leakages play a central role in many modeling competitions, including KDDCup 2007 [4], where the organizers’ preparation of the data for one task exposed some information about the response for the other task, and KDD-Cup 2000 [1], where internal testing patterns that were left in the data by the organizers supplied a significant boost to those who were able to identify them. Exploratory data analysis seems to have become something of a lost art in the KDD community. In proper exploratory analysis, the modeler carefully and exhaustively examines the data with little preconception about what it contains, and allows patterns and phenomena to present themselves, only then analyzing them and questioning their origin and validity. A careful exploratory analysis of the data for this competition would most likely have identified this leakage. We hope that our discovery of this leakage can serve as a reminder of the value of open-minded exploratory
منابع مشابه
Predicting customer behaviour: The University of Melbourne's KDD Cup report
We discuss the challenges of the 2009 KDD Cup along with our ideas and methodologies for modelling the problem. The main stages included aggressive nonparametric feature selection, careful treatment of categorical variables and tuning a gradient boosting machine under Bernoulli loss with trees.
متن کاملWinning the KDD Cup Orange Challenge with Ensemble Selection
We describe our wining solution for the KDD Cup Orange Challenge.
متن کاملBennett Netflix 100 Winchester Circle
INTRODUCTION The KDD Cup is the oldest of the many data mining competitions that are now popular [1]. It is an integral part of the annual ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). In 2007, the traditional KDD Cup competition was augmented with a workshop with a focus on the concurrently active Netflix Prize competition [2]. The KDD Cup itself in 2007 con...
متن کاملCancer of unknown primary ultimately diagnosed as male breast cancer: A rare case report
Cancers of unknown primary (CUP) constitute a significant diagnostic and therapeutic challenge for clinicians and a frequent cause of cancer-related mortality in Western countries. Immunohistochemistry assays are commonly used to identify the primary cancer, but fail in approximately one-third of cases. The identification of the possible origin of CUP is crucial, as it may help select the appro...
متن کاملCombining Predictors for Recommending Music: the False Positives' approach to KDD Cup track 2
We describe our solution for the KDD Cup 2011 track 2 challenge. Our solution relies heavily on ensembling together diverse individual models for the prediction task, and achieved a final leaderboard misclassification rate of 3.8863%. This paper provides details on both the modeling and ensemble
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2008